From the data set provided below, we are going to train a classification model to accurately predict the good and bad credit risks. Obviously, misclassifying as good a credit risk when it is actually bad, is worse than doing the opposite, i.e. misclassifying as bad a credit risk when it is actually good. In this example, this behavior is quantified by the provided cost matrix below:
\ | Predicted | Predicted |
---|---|---|
Actual | Good ('1' ) [P] |
Bad ('2' ) [N] |
Good ('1' ) [P] |
0 [TP] | 1 [FN] |
Bad ('2' ) [N] |
5 [FP] | 0 [TN] |
Rows: actual classification
Columns: predicted classification
('1'
= Good Credit Risk, '2'
= Bad Credit Risk)
Note, that we have consider here the Good Credit Risk ('1'
), as the positive class [P].
The data set 'credit_data.csv'
contains 20 attributes of personal, loan, and credit history details, which are used to associate a customer with a good ('1'
) or bad credit risk ('2'
).
The file is in a comma separated format and contains the variables below:
Attribute 1
: Status of existing checking account [string type, as ordinal categorical (in ascending order)]
'none'
: no checking account'overdraft'
: less than 0 EUR'adequate'
: in [0, 200) EUR'good'
: greater/equal than 200 EUR / salary assignments for at least 1 year Attribute 2
: Duration in months (numerical)
Attribute 3:
Credit history [string type, as ordinal categorical (in ascending order)]
'none'
: no credits taken / all credits paid back duly'Paid-Off'
: all credits at this bank paid back duly'Serviced'
: existing credits paid back duly till now'Delayed'
: delay in paying off in the past'Critical'
: critical account / other credits existing (not at this bank)Attribute 4
: Loan purpose (string type, as nominal categorical)
'car (new)'
'car (used)'
'furniture/equipment'
'radio/television'
'domestic appliances'
'repairs'
'education'
'vacation'
'retraining'
'business'
'others'
Attribute 5
: Credit amount (EUR) (float type, numerical)
Attibute 6
: savings in account/bonds (EUR) (float type, numerical)
Attribute 7
: consecutive employment (years) [string type, as ordinal categorical (in ascending order)]
'E0'
: unemployed'E1'
: less than 1 year'E4'
: in [1,4) years 'E7'
: in [4,7) years'E7+'
: greater/equal 7 yearsAttribute 8
: Installment rate in percentage of disposable income [float type, numerical (percentage)]
Attribute 9
: Personal status and sex (string type, as nominal categorical)
'M0'
: male & single'M1'
: male & married/widowed'M2'
: male & divorced/separated'F0'
: female & single'F1'
: female & divorced/separated/marriedAttribute 10
: Other debtors / guarantors (string type, as nominal categorical)
'co-applicant'
'guarantor'
'none'
Attribute 11
: Present residence since [integer type, numerical (years)]
Attribute 12
: Property (string type, as nominal categorical)
'real estate'
'building society savings agreement / life-insurance'
'car / other'
'unknown / no-property'
Attribute 13:
Age in years (integer type, numerical)
Attribute 14
: Other installment plans (string type, as nominal categorical)
'bank'
'stores'
'none'
Attribute 15
: Housing [string type, as ordinal categorical (in ascending order)]
'guest'
'rent'
'own'
Attribute 16
: Number of existing credits at this bank (integer type, numerical)
Attribute 17
: Work experience [string type, as ordinal categorical (in descending order)]
'A'
: management /self-employed / highly qualified employee / officer'B'
: skilled employee / official'C''
: unskilled - resident'D'
: unemployed/ unskilled - non-residentAttribute 18
: Number of people being liable to provide maintenance for (integer type, numerical)
Attribute 19
: Telephone Exists (string type, as nominal categorical)
'yes'
(Registered under the customers name)'no'
Attribute 20
: foreign worker (string type, as nominal categorical)
'yes'
'no'
In [1]:
# Required Libraries
import os
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier,
AdaBoostClassifier, GradientBoostingClassifier)
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score, zero_one_loss
from sklearn.preprocessing import LabelEncoder
from pprint import pprint
from time import time
from __future__ import print_function
np.set_printoptions(precision=4, suppress=True)
from matplotlib import pyplot as plt
import matplotlib as mlb
import seaborn as sns
In [2]:
# Matplotlib/Seaborn Parameters
mlb.style.use('seaborn-whitegrid')
sns.set_context('talk')
%matplotlib inline
mlb.rcParams['figure.titleweight'] = 'bold'
mlb.rcParams['axes.labelweight'] = 'bold'
mlb.rcParams['axes.titleweight'] = 'bold'
mlb.rcParams['figure.figsize'] = [10,6]
In [3]:
# User-defined Functions [Loaded from: /media/ML_HOME/ML-Code_Base (through a .pth file)]
from visualization_helper_functions import freq_tables
from ML_helper_functions import cv_estimate
In [4]:
# Path Definitions of Required Data Sets
credit_data_path = os.path.join('/media/ML_HOME/ML-Data_Repository/', 'credit_data.csv')
In [5]:
# Load the Credit Data Set, "credit_data.csv"
credit_data_df = pd.read_csv(credit_data_path)
In [6]:
credit_data_df.head()
Out[6]:
In [7]:
# Change the attribute names to shorter ones for our convenience
colnames = ['checking_acc_status',
'loan_duration',
'credit_hist',
'loan_purpose',
'credit_amount',
'savings',
'consecutive_emp',
'installment_rate',
'sex_personal_status',
'other_debtors_guarantors',
'present_residence_since',
'property_type',
'age',
'other_installment_plans',
'housing_type',
'credits_at_bank',
'work_experience',
'num_people_liable',
'phone_exist',
'foreign_worker',
'credit_risk']
new_columns = {key:value for key, value in zip(list(credit_data_df.columns), colnames)}
credit_data_df.rename(columns=new_columns, inplace=True)
In [8]:
credit_data_df.dtypes
Out[8]:
In [9]:
credit_data_df.info()
As a next step, we change appropriately the data types of the categorical variables, and specify the order of these ones that we believe they are better described as ordinal.
In [10]:
credit_data_df.loc[:,'checking_acc_status'] = (credit_data_df.loc[:,'checking_acc_status']
.astype('category', categories=['none', 'overdraft', 'adequate', 'good'],
ordered=True))
credit_data_df.loc[:,'credit_hist'] = (credit_data_df.loc[:,'credit_hist']
.astype('category', categories=['none', 'Paid-Off', 'Serviced', 'Delayed', 'Critical'],
ordered=True))
credit_data_df.loc[:,'loan_purpose'] = (credit_data_df.loc[:,'loan_purpose']
.astype('category', ordered=False))
credit_data_df.loc[:,'consecutive_emp'] = (credit_data_df.loc[:,'consecutive_emp']
.astype('category', categories=['E0', 'E1', 'E4', 'E7', 'E7+'],
ordered=True))
credit_data_df.loc[:,'sex_personal_status'] = (credit_data_df.loc[:,'sex_personal_status']
.astype('category', ordered=False))
credit_data_df.loc[:,'other_debtors_guarantors'] = (credit_data_df.loc[:,'other_debtors_guarantors']
.astype('category',ordered=False))
credit_data_df.loc[:,'property_type'] = (credit_data_df.loc[:,'property_type']
.astype('category', ordered=False))
credit_data_df.loc[:,'other_installment_plans'] = (credit_data_df.loc[:,'other_installment_plans']
.astype('category', ordered=False))
credit_data_df.loc[:,'housing_type'] = (credit_data_df.loc[:,'housing_type']
.astype('category', categories=['guest', 'rent', 'own'], ordered=True))
credit_data_df.loc[:,'work_experience'] = (credit_data_df.loc[:,'work_experience']
.astype('category', categories=['D', 'C', 'B', 'A'], ordered=True))
credit_data_df.loc[:,'phone_exist'] = (credit_data_df.loc[:,'phone_exist']
.astype('category', ordered=False))
credit_data_df.loc[:,'foreign_worker'] = (credit_data_df.loc[:,'foreign_worker']
.astype('category', ordered=False))
credit_data_df.loc[:,'credit_risk'] = credit_data_df.loc[:,'credit_risk'].astype('str')
In [11]:
credit_data_df.dtypes
Out[11]:
In [12]:
sns.countplot(x='credit_risk', data=credit_data_df)
plt.title('Classes of Credit Risk\n[\'credit_data_df\']')
plt.show()
From the frequency tables below, among all the categorical variable values and their distribution among the two classes of credit risk, we can safely assume that all these attributes are important for the predictive model we are going to build. Of course, some attributes will be more important than others, but we better leave the machine learning algorithm to decide on that.
In [13]:
categ_attribs = list(credit_data_df.select_dtypes(include=['category']).columns)
In [14]:
# Frequency tables and barplots of categorical attributes
# to depict their distribution among the two classes of credit risk
freq_tables(credit_data_df, categ_attribs, 'credit_risk',
barplot=True, matplotlib_style='seaborn-whitegrid')
In [15]:
# Keep the important categorical attributes in a separate list
imp_categ_attribs = categ_attribs
imp_categ_attribs
Out[15]:
From the boxplot diagrams below, depicting the distribution of the integer variables values across the two classes of credit risk, we can safely exlude from our consideration the attributes below:
'present_residence_since'
'credits_at_bank'
'num_people_liable'
These attributes do not seem to be influencive for the response variable, 'credit_risk'
, we want to predict.
In [16]:
int_attribs = list(credit_data_df.select_dtypes(include=['int']).columns)
In [17]:
credit_data_df[int_attribs].describe()
Out[17]:
In [18]:
# Plot the Boxplot of all the int_attribs vs the "credit_risk" class ("1"/"2")
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(14, 6*3))
axes_coord = [(i,j) for i in range(3) for j in range(2)]
for attrib in int_attribs:
sns.boxplot(x='credit_risk', y=attrib, data=credit_data_df, ax=axes[axes_coord.pop(0)])
fig.tight_layout(h_pad=2, w_pad=2, rect=[0, 0, 0.93, 0.93])
fig.suptitle('\"Integer Attributes values\" vs \"Credit Risk Class\"',
fontsize=16, fontweight='bold')
plt.show()
In [19]:
# Keep the important integer attributes in a separate
imp_int_attribs = int_attribs
imp_int_attribs.remove('present_residence_since')
imp_int_attribs.remove('credits_at_bank')
imp_int_attribs.remove('num_people_liable')
imp_int_attribs
Out[19]:
From the boxplot diagrams below, and the frequency tables concerning the distribution of the 'installment_rate'
and 'savings'
values across the two classes of credit risk, we can safely assume that all these attributes are important. Furthermore, we better descritize the 'savings'
values into quartiles as shown below, and use it in this form in order to train the machine learning algorithm of choice.
In [20]:
float_attribs = list(credit_data_df.select_dtypes(include=['float']).columns)
In [21]:
# Plot the Boxplot of all the "float_attribs" vs the "credit_risk" class ("1"/"2")
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(14, 6))
axes_coord = [(i) for i in range(3)]
for attrib in float_attribs:
sns.boxplot(x='credit_risk', y=attrib, data=credit_data_df, ax=axes[axes_coord.pop(0)])
fig.tight_layout(h_pad=2, w_pad=2, rect=[0, 0, 0.93, 0.93])
fig.suptitle('\"Float Attributes values\" vs \"Credit Risk Class\"',
fontsize=16, fontweight='bold')
plt.show()
In [22]:
freq_tables(credit_data_df, ['installment_rate'], 'credit_risk',
barplot=True, matplotlib_style='seaborn-whitegrid')
In [23]:
credit_data_df['savings_factor'], bins = pd.qcut(credit_data_df['savings'],
[0, .25, .50, .75, 1], retbins=True,
labels=['Q1', 'Q2', 'Q3', 'Q4'])
print('1st-quartile: [%.2f, %.2f] EUR in \'savings\'' % (bins[0], bins[1]))
print('2nd-quartile: (%.2f, %.2f] EUR in \'savings\'' % (bins[1], bins[2]))
print('3rd-quartile: (%.2f, %.2f] EUR in \'savings\'' % (bins[2], bins[3]))
print('4th-quartile: (%.2f, %.2f] EUR in \'savings\'' % (bins[3], bins[4]))
In [24]:
freq_tables(credit_data_df, ['savings_factor'], 'credit_risk',
barplot=True, matplotlib_style='seaborn-whitegrid')
In [25]:
# Keep the important float attributes in a separate list
imp_float_attribs = float_attribs
imp_float_attribs.remove('savings')
imp_float_attribs
Out[25]:
In [26]:
# Update the important categorical attributes with the new "savings_factor"
imp_categ_attribs = list(credit_data_df.select_dtypes(include=['category']).columns)
From the discussion above, we conclude that the useful part of the provided data set has only two integer variables, i.e. 'loan_duration'
and 'age'
of customer, two float attributes, i.e. 'credit_amount'
and 'installment_rate'
, whereas all the others attributes are nominal or ordinal categorical variables. The business question we have to answer, on the other hand, namely to accurately predict the credit risk class of a client (represented by a record line here) is a simple, binary classification problem. Therefore, decission tree classification algorithms, or ensembles build on top of these algorithms (e.g. Random Forest or Extremely Randomized Trees), are expected to provide high predictive performance for this kind of task.
In [27]:
response = 'credit_risk'
imp_attribs = imp_categ_attribs + imp_int_attribs + imp_float_attribs
In [28]:
credit_data_df[imp_attribs + [response]].head()
Out[28]:
"credit_data_df"
: "train"
& "test"
partNext, we prepare the provided data set for the scikit-learn classifiers. This requires to appropriately encode the categorical variables and split the provided data into two parts, a training and a test set. Note, that due to the class imbalance of the problem, we apply the StratifiedKFold
method to take this fact into account. In addition, by choosing the 'shuffle=True'
flag, we eliminate the possibility of a data ordering that is not arbitrary (e.g. samples with the same class label being contiguous). The samples are independently and identically distributed in this case, which makes shuffling an additional possibility to get a meaningful cross-validation result.
In [29]:
RANDOM_STATE = 1234
# Encode the Categorical Variables
le = LabelEncoder()
credit_data_df1 = credit_data_df[imp_attribs + [response]].copy()
for attrib in imp_categ_attribs:
credit_data_df1.loc[:,attrib] = le.fit_transform(credit_data_df1[attrib].values)
# Split the Attribute from the Response variables
X = credit_data_df1[imp_attribs]
y = credit_data_df1[response].astype(int)
# Split the provided data set in a "train" and a "test" part
skf = StratifiedKFold(n_splits=2,shuffle=True, random_state=RANDOM_STATE)
for train_ix, test_ix in skf.split(X, y):
X_train, X_test = X.loc[train_ix,:], X.loc[test_ix,:]
y_train, y_test = y[train_ix], y[test_ix]
"dtree"
)It is important to note that the provided cost matrix:
\ | Predicted | Predicted |
---|---|---|
Actual | Good ('1' ) [P] |
Bad ('2' ) [N] |
Good ('1' ) [P] |
0 [TP] | 1 [FN] |
Bad ('2' ) [N] |
5 [FP] | 0 [TN] |
which can help us better quantify the misclassifications of Good ('1'
) and Bad Credit Risks ('2'
), can be nicely realized by the so-called $F_{\beta}$ score for $\beta=\frac{1}{\sqrt{5}}$,
where $$ \mathit{precision} = \frac{TP}{TP + FP}\,,\qquad \mathit{recall}=\frac{TP}{TP + FN}\,. $$
Indeed, since we consider here the Good Credit Risk ('1'
) as the positive class [P], the provided cost matrix calls for a $\mathit{precision}$ function that should be weighted five times higher than the $\mathit{recall}$ function. To see why this is true for $F_{\beta=1/\sqrt{5}}$, recall that the F-measure, $F_{\beta}$, is related with the Van Rijsbergen's effectiveness measure, $E$, as below:
where
$$ E = 1 - \left(\frac{\alpha}{\mathit{precision}} + \frac{1-\alpha}{\mathit{recall}}\right)^{-1}\,,\qquad \alpha=\frac{1}{1+\beta^{2}}\,. $$
In [30]:
RANDOM_STATE = 345
# Instantiate a simple DTree to fit
dtree = tree.DecisionTreeClassifier(class_weight='balanced',
random_state=RANDOM_STATE)
# Determine a Parameter Grid
# to better investigate their adequacy for the DTree defined above
param_grid = {'max_depth': [3, 4, 5, 6],
'min_samples_leaf': [5, 4, 3, 2, 1],
'max_features': [None, 'sqrt']}
# Define the scoring function of interest
betaSq = 1/5
fb_score = make_scorer(fbeta_score, beta=np.sqrt(betaSq), pos_label=1, average='binary')
# Instantiate the "GridSearchCV" method
grid0 = GridSearchCV(dtree, param_grid, scoring=fb_score, cv=5, iid=True)
# Perform the Parameter Grid Search
grid0.fit(X_train, y_train)
# Return the results found
print('Best Estimaror found by "GridSearchCV":\n\n%s' % grid0.best_estimator_)
print('\nBest Parameters found by "GridSearchCV":\n\n%s' % grid0.best_params_)
print('\nBest "F_beta Score" (beta = 1/sqrt{5}) found by "GridSearchCV [train set]": %.4f' % grid0.best_score_)
# Provide the F_beta score using the test case
y_pred = grid0.predict(X_test)
fbeta_score_test = fbeta_score(y_test, y_pred, beta=np.sqrt(betaSq), pos_label=1, average='binary')
print('\n"F_beta Score" (beta = 1/sqrt{5}) [test set]: %.4f' % fbeta_score_test)
# Export a Graphical Visualization of the fitted "dtree" by the "GridSearchCV" method
import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(grid0.best_estimator_.tree_, out_file=None,
feature_names=imp_attribs,
class_names=credit_data_df1.credit_risk,
rotate=True, filled=True, rounded=True,special_characters=False)
graph = pydotplus.graph_from_dot_data(dot_data)
#graph.write_pdf('./credit_data_DT.pdf')
Image(graph.create_png())
Out[30]:
First, we compare which averaging ensemble method performs best, and find the required number of estimators to reach a plateau of OOB Error.
In [31]:
from collections import OrderedDict
RANDOM_STATE = 345
# Range of "n_estimators" values to explore.
min_estimators = 30
max_estimators = 500
# Instantiate the Ensemble Classifiers of interest
ensemble_clfs = [("RandomForestClassifier, max_features='sqrt'",
RandomForestClassifier(max_features='sqrt',
max_depth=None, min_samples_leaf=1,
warm_start=True, bootstrap=True, oob_score=True,
random_state=RANDOM_STATE)),
("ExtraTreesClassifier, max_features='sqrt'",
ExtraTreesClassifier(max_features='sqrt',
max_depth=None, min_samples_leaf=1,
warm_start=True, bootstrap=True, oob_score=True,
random_state=RANDOM_STATE))]
# Map a classifier name to a list of (<n_estimators>, <error rate>) pairs.
error_rate = OrderedDict((label, []) for label, _ in ensemble_clfs)
for label, clf in ensemble_clfs:
s = '\nTraining: %s' % label
print(s)
print('-' * (len(s) + 3))
for i in range(min_estimators, max_estimators + 1):
clf.set_params(n_estimators=i)
clf.fit(X_train, y_train)
# Record the OOB Error Rate returned
# by Classifier for each "n_estimators=i" setting
oob_error = 1 - clf.oob_score_
error_rate[label].append((i, oob_error))
# Generate the "OOB error rate" vs. "n_estimators" plot
plt.figure(figsize=(12, 7.5))
ys_min = 1; ys_max = 0
for label, clf_err in error_rate.items():
xs, ys = zip(*clf_err)
clf_err_nd = np.array(clf_err)
label_ys_min = clf_err_nd[:,1].min(); label_ys_max = clf_err_nd[:,1].max()
label_xs0 = clf_err_nd[clf_err_nd[:,1].argmin(),0]
if label_ys_min < ys_min:
ys_min = min([ys_min, label_ys_min])
xs0 = label_xs0
ys_max = max([ys_max, label_ys_max])
plt.plot(xs, ys, '-', label=label)
plt.axvline(x=xs0, color='r', linestyle='dashed',
label=r'(n_estimators=%d, OOB Error=%.2f)' % (xs0, ys_min))
plt.xlim(0, max_estimators)
plt.xlabel("n_estimators")
plt.ylabel("OOB Error Rate")
plt.title('OOB Error Rates (train set)\n[Random Forest & Extra Trees Classifiers (max_features=\'sqrt\')]')
plt.legend(loc="upper right", prop={'weight':'bold'})
plt.show()
Next, we tune the averaging ensemble method which is found best by the OOB Error Rates plots above. This was the Random Forest Classifier. Note, that an improved $F_{\beta}$ score over the test set is now achieved, i.e. $F_{\beta=1/\sqrt{5}}~[{test~set}]=0.7748$, in comparison with the $F_{\beta=1/\sqrt{5}}~[{test~set}] = 0.7704$ we had earlier by appropriately fitting a simple decision tree.
In [32]:
# Tune the averaging ensemble method
# that is found being best from the discussion above
label, clf = ensemble_clfs[0]
n_estimators = 400
clf.set_params(n_estimators=n_estimators,
max_features='sqrt')
print('Tuning the averaging ensemble method that is found being best by the OOB Error Rates plots...\n')
print(clf)
# Parameter Grid to better investigate for their adequacy
param_grid = {'max_depth': [3, 4, 5, 6, None],
'min_samples_leaf': [5, 4, 3, 2, 1]}
# Define the scoring function of interest
betaSq = 1/5
fb_score = make_scorer(fbeta_score, beta=np.sqrt(betaSq), pos_label=1, average='binary')
# Instantiate the "GridSearchCV" method
grid1 = GridSearchCV(clf, param_grid, scoring=fb_score, cv=5, iid=True)
# Perform the Parameter Grid Search
grid1.fit(X_train, y_train)
# Return the results found
print('\nBest Estimaror found by "GridSearchCV":\n\n%s' % grid1.best_estimator_)
print('\nBest Parameters found by "GridSearchCV":\n%s' % grid1.best_params_)
print('\nBest "F_beta Score" (beta = 1/sqrt{5}) found by "GridSearchCV [train set]": %.4f' % grid1.best_score_)
# Provide the F_beta score using the test case
y_pred = grid1.predict(X_test)
fbeta_score_test = fbeta_score(y_test, y_pred, beta=np.sqrt(betaSq), pos_label=1, average='binary')
print('\n"F_beta Score" (beta = 1/sqrt{5}) [test set]: %.4f' % fbeta_score_test)
The features importances are also depicted in the barplot diagram below. The amount that has been credited by a customer, 'credit_amount'
, is the most important attribute for the predictive model we want to build, with her 'age'
, the 'loan_duration'
, the status of her checking account, 'checking_acc_status'
, and the credit history, 'credit_hist'
, she has so far, coming immediately after. Less, but also important are the attributes that have to do with the years of consecutive employment, 'consecutive_emp'
, the 'loan_purpose'
, its 'installment_rate'
, the 'property_type'
, the savings that currently exist in customer's account, 'savings_factor'
, her 'sex_personal_status'
, her 'work_experience'
, and finaly if she owns/rents or being a guest in a property, i.e. 'housing_type'
. Note however, that if a phone exists ('phone_exists'
), and/or if the customer is a foreign worker or not ('foreign_worker'
), do not seem to be important.
In [33]:
# Store the feature importances in a separate list
importances = grid1.best_estimator_.feature_importances_
# Compute the standard deviation of feature importances across all the "n_estimators"
std = np.std([tree.feature_importances_ for tree in grid1.best_estimator_.estimators_], axis=0)
# Sort in a descending order of importance the feature indices
importances_idx = np.argsort(importances)[::-1]
# Print the Feature Importance
print('Feature Importance:')
for f in range(X_train.shape[1]):
print("%d. feature %s (%f)" %
(f + 1, imp_attribs[importances_idx[f]], importances[importances_idx[f]]))
# Plot the feature importances of the forest
plt.figure(figsize=(12, 7.5))
plt.bar(range(X_train.shape[1]), importances[importances_idx],
color='r', yerr=std[importances_idx], align='center')
plt.xticks(range(X_train.shape[1]),
[imp_attribs[i] for i in importances_idx], fontweight='bold',
rotation='vertical')
plt.yticks(fontweight='bold')
plt.xlim([-1, X_train.shape[1]])
plt.title('Feature importances')
plt.show()
In this section, we investigate if an additional step of applying a Machine Learning algorithm on the training set that remains if we filter out the unimportant features found above, can improve the predictive performance any further.
More specifically, we are going to train two boosting algorithms, the AdaBoost and the Gradient Boosting Algorithm, using the training set that remains after a feature selection suggested by the grid1.best_estimator_
:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='sqrt', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=400, n_jobs=1, oob_score=True, random_state=345,
verbose=0, warm_start=True)
To do so, we apply the sklearn.feature_selection.SelectFromModel
method at various thresholds of interest, namely '0.58 * median'
(returning attributes with importances greater or equal than the first quartile value, Q1
), 'median'
, 'mean'
, and '1.58 * median'
which selects attributes with importances greater or equal than the Q3
value.
The results of our investigation are depicted in the composite plot shown below. We trained the AdaBoost and the Gradient Boosting Classifier, over 700
and 200
'n_estimators'
, having slightly altered the default parameters of the base estimators that are used by these algorithms to minimize that way the returned cross-validation error. We also altered the default 'learning_rate'
in both these calls, and chose 'subsample=0.5'
to apply bagging in the second case.
We found that fitting a Gradient Boosting Classifier, as a next step of a Machine Learning pipeline returns much better results, as far as the CV Error Rate concerns. More specifically, the best results were achieved for
'n_estimators=130'
and '0.58 * median'
threshold of the feature selection procedure, SelectFromModel
.
In [34]:
RANDOM_STATE = 345
# Max number of "n_estimators"
n_estimators0 = [700, # AdaBoostClassifier
200] # GradientBoostingClassifier
# Thresholds values to use for feature selection
thresholds = ['0.58 * median', # RF.feature_importances_ >= Q1
'median',
'mean',
'1.58 * median'] # RF.feature_importances_ >= Q3
# Instantiate the basic DT_stump classifier
dt_stump = tree.DecisionTreeClassifier(max_depth=1,
max_features=None,
min_samples_leaf=6,
class_weight='balanced',
random_state=RANDOM_STATE)
# Instantiate the Ensemble Boosting Classifiers of interest
boosting_clfs = [("AdaBoostClassifier, dt_stump (max_depth=1, min_samples_leaf=6)",
AdaBoostClassifier(base_estimator=dt_stump,
learning_rate=0.6,
random_state=RANDOM_STATE)),
("GradientBoostingClassifier, (base estimator: max_depth=3, min_samples_leaf=1)",
GradientBoostingClassifier(loss='deviance',
learning_rate=0.03,
max_features=None,
subsample=0.5, # USE bagging FOR BETTER PERFORMANCE
max_depth=3, # default [basic estimator]
min_samples_leaf=1, # default [basic estimator]
random_state=RANDOM_STATE))]
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(14,12)); axes = list(axes)
for label, clf in boosting_clfs:
ax = axes.pop(0)
sns_colors = list(sns.color_palette())
n_estimators = n_estimators0.pop(0)
cv_best_err = 1; cv_best_iter = 0
for thr in thresholds:
s1 = '\nTraining: %s' % ','.join(label.split(sep=',', maxsplit=1))
print(s1)
print('-' * (len(s1) + 3))
print(clf)
print('\n[using Feature Selection (at threshold: %s)]' % thr)
# Choose plotting color for the current Training/Test Errors
ax_col = sns_colors.pop(0)
# FEATURE SELECTION USING THE PREVIOUSLY FOUND "best_estimator"
model = SelectFromModel(grid1.best_estimator_,
threshold=thr, prefit=True)
X_train_stg1 = model.transform(X_train)
X_test_stg1 = model.transform(X_test)
y_train = np.array(y_train); y_test= np.array(y_test)
# FIT THE "clf" BOOSTING CLASSIFIER ON THE SELECTED DATA SET
clf.set_params(n_estimators=n_estimators)
clf.fit(X_train_stg1, y_train)
# COMPUTE THE TRAINING ERROR MADE BY THE "clf" BOOSTING CLASSIFIER
x = np.arange(n_estimators)
train_err = np.zeros((n_estimators,))
for i, y_pred in enumerate(clf.staged_predict(X_train_stg1)):
train_err[i] = zero_one_loss(y_train, y_pred)
# COMPUTE THE CV ERROR MADE BY THE "clf" BOOSTING CLASSIFIER
cv_err = cv_estimate(clf, X_train_stg1, y_train, n_estimators=n_estimators, n_splits=3,
shuffle=False, random_state=RANDOM_STATE)
if np.min(cv_err) < cv_best_err:
cv_best_label = thr
cv_best_ax_col = ax_col
cv_best_err = np.min(cv_err)
cv_best_iter = x[np.argmin(cv_err)]
# PRODUCE THE TRAINING/CV-ERROR RATES
linesalpha = 0.7
ax.plot(x, train_err, color=ax_col, alpha=linesalpha,
linestyle='dashed', label= 'Train Error [threshold: %s]' % thr)
ax.plot(x, cv_err, color=ax_col, alpha=linesalpha,
label= 'CV Error [threshold: %s]' % thr)
# SUB-FIGURE POLISHING
ax.axvline(x=cv_best_iter, color=cv_best_ax_col, linestyle='dashed', linewidth=3.0,
label='(n_estimators=%d, CV Error=%.3f)\n[threshold: %s]' % (cv_best_iter, cv_best_err, cv_best_label))
leg = ax.legend(loc = 'lower left', prop = {'weight': 'bold'})
leg.set_title('Training/CV-Errors\n[Features Selected by RF Classifier at thresholds]:',
prop={'size': 14, 'weight':'bold'})
ax.set_xlabel('n_estimators')
ax.set_ylabel('Error Rate')
ax.set_title('\n'.join(label.split(sep=',', maxsplit=1)))
# Figure polishing
fig.suptitle('Training / Cross-Validation Error Rates', fontsize=16, fontweight='bold')
fig.tight_layout(h_pad=2, w_pad=2, rect=[0,0,0.96,0.96])
plt.show()
In [35]:
# Tune the 2nd Stage ML Classfier [Gradient Boosting Classifier]
# that is found being best from the discussion above
label, clf = boosting_clfs[1]
n_estimators = 1500 # to give room for the Classifier to be trained
clf.set_params(n_estimators=n_estimators)
# ML Pipeline
pipe = Pipeline([('feature_selection',
SelectFromModel(grid1.best_estimator_, threshold='0.58*median', prefit=False)),
('GradientBoostingClassifier', clf)])
print('Tuning the boosting ensemble method that is found being best by the CV-Error rates plots...\n')
print(clf)
# Parameter Grid to better investigate for their adequacy
param_grid = {'GradientBoostingClassifier__max_depth': [1, 2, 3, 4],
'GradientBoostingClassifier__min_samples_leaf': [1, 2, 3, 4, 5, 6],
'GradientBoostingClassifier__learning_rate': [0.01, 0.03, 0.06]}
# Define the scoring function of interest
betaSq = 1/5
fb_score = make_scorer(fbeta_score, beta=np.sqrt(betaSq), pos_label=1, average='binary')
# Instantiate the "GridSearchCV" method
grid2 = GridSearchCV(pipe, param_grid, scoring=fb_score, cv=5, iid=True)
# Perform the Parameter Grid Search
print('\nPerforming grid search...')
print('\nPipeline:', [name for name, _ in pipe.steps])
print('\nParameters:')
pprint(param_grid)
t0 = time()
grid2.fit(X_train, y_train)
print('\nCompleted in %0.3fsec\n' % (time() - t0))
# Return the results found
print('\nBest Parameters found by "GridSearchCV":')
pprint(grid2.best_params_)
print('\nBest "F_beta Score" (beta = 1/sqrt{5}) found by "GridSearchCV [train set]": %.4f' % grid2.best_score_)
Finally, we provide predictions over the held-out test set (X_test
), using the best estimator which has been returned by the procedure above. The $F_{\beta=1/\sqrt{5}}$ score which is now achieved is significantly higher than before, i.e.
where $$ \mathit{precision} = \frac{TP}{TP + FP}\,,\qquad \mathit{recall}=\frac{TP}{TP + FN}\,. $$
In [36]:
# Provide the F_beta score using the test case
y_pred = grid2.predict(X_test)
fbeta_score_test = fbeta_score(y_test, y_pred, beta=np.sqrt(betaSq), pos_label=1, average='binary')
print('\n"F_beta Score" (beta = 1/sqrt{5}) [test set]: %.4f' % fbeta_score_test)